##Introduction
In this project, we’re assigned to 3 tasks
1.Identify factor that lead to attrition, also identify the top three factors that contribute to turnover.
2.The executive leadership is also interested in learning about any job role specific trends that may exist in the data set. We need to provide any other interesting trends and observations from your analysis.
3.We’re asked to build a model to predict attrition and salary.
#Import and tidying datasets
There are 870 obs. of 32 variables. Among the 32 variables, 8 columns are characters, 10 columns are factors, 14 columns are numeric. No missing values in data set.
## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Length:870 Min. :1 Min. : 1.0 Min. :1.000
## Class :character 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
## Mode :character Median :1 Median :1039.0 Median :3.000
## Mean :1 Mean :1029.8 Mean :2.701
## 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :1 Max. :2064.0 Max. :4.000
## Gender HourlyRate JobInvolvement JobLevel
## Length:870 Min. : 30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median : 66.00 Median :3.000 Median :2.000
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:870 Min. :1.000 Length:870 Min. : 1081
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2840
## Mode :character Median :3.000 Mode :character Median : 4946
## Mean :2.709 Mean : 6390
## 3rd Qu.:4.000 3rd Qu.: 8182
## Max. :4.000 Max. :19999
## MonthlyRate NumCompaniesWorked Over18 OverTime
## Min. : 2094 Min. :0.000 Length:870 Length:870
## 1st Qu.: 8092 1st Qu.:1.000 Class :character Class :character
## Median :14074 Median :2.000 Mode :character Mode :character
## Mean :14326 Mean :2.728
## 3rd Qu.:20456 3rd Qu.:4.000
## Max. :26997 Max. :9.000
## PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
## Min. :11.0 Min. :3.000 Min. :1.000 Min. :80
## 1st Qu.:12.0 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80
## Median :14.0 Median :3.000 Median :3.000 Median :80
## Mean :15.2 Mean :3.152 Mean :2.707 Mean :80
## 3rd Qu.:18.0 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80
## Max. :25.0 Max. :4.000 Max. :4.000 Max. :80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000
## Median :1.0000 Median :10.00 Median :3.000 Median :3.000
## Mean :0.7839 Mean :11.05 Mean :2.832 Mean :2.782
## 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 5.000 Median : 3.000 Median : 1.000
## Mean : 6.962 Mean : 4.205 Mean : 2.169
## 3rd Qu.:10.000 3rd Qu.: 7.000 3rd Qu.: 3.000
## Max. :40.000 Max. :18.000 Max. :15.000
## YearsWithCurrManager
## Min. : 0.00
## 1st Qu.: 2.00
## Median : 3.00
## Mean : 4.14
## 3rd Qu.: 7.00
## Max. :17.00
## 'data.frame': 870 obs. of 36 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : chr "No" "No" "No" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Sales" ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : chr "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : chr "Male" "Male" "Male" "Female" ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : chr "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : chr "Divorced" "Single" "Single" "Married" ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "No" "No" "No" "No" ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
| Name | training_data |
| Number of rows | 870 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 27 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Attrition | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| BusinessTravel | 0 | 1 | 10 | 17 | 0 | 3 | 0 |
| Department | 0 | 1 | 5 | 22 | 0 | 3 | 0 |
| EducationField | 0 | 1 | 5 | 16 | 0 | 6 | 0 |
| Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| JobRole | 0 | 1 | 7 | 25 | 0 | 9 | 0 |
| MaritalStatus | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| Over18 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| OverTime | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 435.50 | 251.29 | 1 | 218.25 | 435.5 | 652.75 | 870 | ▇▇▇▇▇ |
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.00 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.50 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
| EmployeeNumber | 0 | 1 | 1029.83 | 604.79 | 1 | 477.25 | 1039.0 | 1561.50 | 2064 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.00 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.50 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.00 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80.00 | 80.0 | 80.00 | 80 | ▁▁▇▁▁ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.00 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
##Removing unnecessary columns from training set and setting all categorial to be factors
## Attrition n
## 1 No 29
## 2 Yes 6
## 3 No 487
## 4 Yes 75
## 5 No 214
## 6 Yes 59
There seems to be a quadratic trend, there’s a high level of attriction in late teens and early 20s. It levels off in the 30s, and starts picking back up in the 50s
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
##Attrition VS JobSatisfaction
Seems to be a very strong correlation between JobSatisfaction and attrition rate, with the greater job satisfaction the better less the likelhood for attrition.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
##Attrition VS total working years
Similar to age it seems that there is less likelihood
## Warning: Removed 10 rows containing missing values (geom_point).
##Attrition VS Job Role Sales representative appear to have a much higher attrition rate
##Attrition VS PercentSalaryHike There’s a very small correlation between percent salary hike and attrition
##Attrition VS hourly rate Doesn’t appear to be any real correlation between hourly rate and attrition
## Warning: Removed 12 rows containing missing values (geom_point).
##Attrition VS OverTime Working overtime appears to have a significant impact on attrition rate
##Attrition VS Monthly Income
## Warning: Removed 813 rows containing missing values (geom_point).
##Testing Bayes models with factor that had the most impact on attrtion Age, Job Satisfaction, Job Role, Totalworkinyears and Hourly Rate, and then find the best model
## Confusion Matrix and Statistics
##
##
## No Yes
## No 225 2
## Yes 32 2
##
## Accuracy : 0.8697
## 95% CI : (0.8227, 0.9081)
## No Information Rate : 0.9847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.08
##
## Mcnemar's Test P-Value : 6.577e-07
##
## Sensitivity : 0.87549
## Specificity : 0.50000
## Pos Pred Value : 0.99119
## Neg Pred Value : 0.05882
## Prevalence : 0.98467
## Detection Rate : 0.86207
## Detection Prevalence : 0.86973
## Balanced Accuracy : 0.68774
##
## 'Positive' Class : No
##
## [1] 0.842069
## [1] 0.002061901
## [1] 0.8508076
## [1] 0.002113658
## [1] 0.6040192
## [1] 0.002113658
## [1] 0.8407663
## [1] 0.00207462
## [1] 0.8507683
## [1] 0.002080485
## [1] 0.5916438
## [1] 0.002080485
## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962
## Confusion Matrix and Statistics
##
##
## No Yes
## No 214 4
## Yes 38 5
##
## Accuracy : 0.8391
## 95% CI : (0.7888, 0.8815)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1435
##
## Mcnemar's Test P-Value : 3.543e-07
##
## Sensitivity : 0.8492
## Specificity : 0.5556
## Pos Pred Value : 0.9817
## Neg Pred Value : 0.1163
## Prevalence : 0.9655
## Detection Rate : 0.8199
## Detection Prevalence : 0.8352
## Balanced Accuracy : 0.7024
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 211 7
## Yes 37 6
##
## Accuracy : 0.8314
## 95% CI : (0.7804, 0.8748)
## No Information Rate : 0.9502
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1492
##
## Mcnemar's Test P-Value : 1.232e-05
##
## Sensitivity : 0.8508
## Specificity : 0.4615
## Pos Pred Value : 0.9679
## Neg Pred Value : 0.1395
## Prevalence : 0.9502
## Detection Rate : 0.8084
## Detection Prevalence : 0.8352
## Balanced Accuracy : 0.6562
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 226 1
## Yes 34 0
##
## Accuracy : 0.8659
## 95% CI : (0.8185, 0.9048)
## No Information Rate : 0.9962
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0075
##
## Mcnemar's Test P-Value : 6.338e-08
##
## Sensitivity : 0.8692
## Specificity : 0.0000
## Pos Pred Value : 0.9956
## Neg Pred Value : 0.0000
## Prevalence : 0.9962
## Detection Rate : 0.8659
## Detection Prevalence : 0.8697
## Balanced Accuracy : 0.4346
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 223 4
## Yes 29 5
##
## Accuracy : 0.8736
## 95% CI : (0.827, 0.9113)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1883
##
## Mcnemar's Test P-Value : 2.943e-05
##
## Sensitivity : 0.8849
## Specificity : 0.5556
## Pos Pred Value : 0.9824
## Neg Pred Value : 0.1471
## Prevalence : 0.9655
## Detection Rate : 0.8544
## Detection Prevalence : 0.8697
## Balanced Accuracy : 0.7202
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 224 3
## Yes 30 4
##
## Accuracy : 0.8736
## 95% CI : (0.827, 0.9113)
## No Information Rate : 0.9732
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1577
##
## Mcnemar's Test P-Value : 6.011e-06
##
## Sensitivity : 0.8819
## Specificity : 0.5714
## Pos Pred Value : 0.9868
## Neg Pred Value : 0.1176
## Prevalence : 0.9732
## Detection Rate : 0.8582
## Detection Prevalence : 0.8697
## Balanced Accuracy : 0.7267
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 222 2
## Yes 32 5
##
## Accuracy : 0.8697
## 95% CI : (0.8227, 0.9081)
## No Information Rate : 0.9732
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1908
##
## Mcnemar's Test P-Value : 6.577e-07
##
## Sensitivity : 0.8740
## Specificity : 0.7143
## Pos Pred Value : 0.9911
## Neg Pred Value : 0.1351
## Prevalence : 0.9732
## Detection Rate : 0.8506
## Detection Prevalence : 0.8582
## Balanced Accuracy : 0.7942
##
## 'Positive' Class : No
##
## [1] 0.8406897
## [1] 0.002048626
## [1] 0.8509524
## [1] 0.002089431
## [1] 0.57255
## [1] 0.002089431
#The best Bayes model includes Age, JobRole, JobSatisfaction, and Overtime #Accuracy of 85%, sensitiviy of .859 and specificity of .64
## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962
##Comparing against Knn Model, results as below mean(AccHolder) 0.8330268 sd(AccHolder)/sqrt(100) 0.00218046 mean(SensHolder) 0.8517895 sd(SensHolder)/sqrt(100) 0.002144115 mean(SpecHolder) 0.4511064 sd(SensHolder)/sqrt(100) 0.002144115
## integer(0)
## [1] NA
## [1] 0.8330268
## [1] 0.00218046
## [1] 0.8517895
## [1] 0.002144115
## [1] 0.4511064
## [1] 0.002144115
##The best Bayes model includes Age, JobRole, JobSatisfaction, and Overtime
##EDA for imputing Monthly Income
So far highest correlatoin is between Total working years and monthly income Total working years has a .779 corr while years at company has .491 corr JobLevel has a corr of .952 Age has a .485 correlation Years since last promotion has a .316 correlation
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##First Model for computing Monthly Incomes First model using Joblevel and income has a rmse of 1410.878
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4642.2 -668.0 -107.3 668.3 4412.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2743.82 69.69 39.37 <2e-16 ***
## JobLevel2 2800.46 99.89 28.04 <2e-16 ***
## JobLevel3 7108.38 130.24 54.58 <2e-16 ***
## JobLevel4 12509.83 177.45 70.50 <2e-16 ***
## JobLevel5 16480.15 219.18 75.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9244
## F-statistic: 2658 on 4 and 865 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 2607.044 2880.604
## JobLevel2 2604.402 2996.509
## JobLevel3 6852.766 7363.996
## JobLevel4 12161.551 12858.101
## JobLevel5 16049.957 16910.342
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4642.2 -668.0 -107.3 668.3 4412.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2743.82 69.69 39.37 <2e-16 ***
## JobLevel2 2800.46 99.89 28.04 <2e-16 ***
## JobLevel3 7108.38 130.24 54.58 <2e-16 ***
## JobLevel4 12509.83 177.45 70.50 <2e-16 ***
## JobLevel5 16480.15 219.18 75.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9244
## F-statistic: 2658 on 4 and 865 DF, p-value: < 2.2e-16
## [1] 1216.151
##2nd Model ading TotalWorkingYears
Adding the totalworkingyears got a better error with 1365
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4957.9 -657.8 -134.6 618.2 4525.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2544.901 89.085 28.567 < 2e-16 ***
## JobLevel2 2652.205 107.666 24.634 < 2e-16 ***
## JobLevel3 6820.371 152.732 44.656 < 2e-16 ***
## JobLevel4 11858.212 254.564 46.582 < 2e-16 ***
## JobLevel5 15800.546 289.997 54.485 < 2e-16 ***
## TotalWorkingYears 33.442 9.426 3.548 0.000409 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared: 0.9258, Adjusted R-squared: 0.9254
## F-statistic: 2157 on 5 and 864 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 2370.05330 2719.74815
## JobLevel2 2440.88699 2863.52254
## JobLevel3 6520.60145 7120.13957
## JobLevel4 11358.57533 12357.84882
## JobLevel5 15231.36546 16369.72723
## TotalWorkingYears 14.94155 51.94211
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4957.9 -657.8 -134.6 618.2 4525.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2544.901 89.085 28.567 < 2e-16 ***
## JobLevel2 2652.205 107.666 24.634 < 2e-16 ***
## JobLevel3 6820.371 152.732 44.656 < 2e-16 ***
## JobLevel4 11858.212 254.564 46.582 < 2e-16 ***
## JobLevel5 15800.546 289.997 54.485 < 2e-16 ***
## TotalWorkingYears 33.442 9.426 3.548 0.000409 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared: 0.9258, Adjusted R-squared: 0.9254
## F-statistic: 2157 on 5 and 864 DF, p-value: < 2.2e-16
## [1] 1203.668
##3rd Model adding age as well
Found that adding the factors with most Correllations, that being JobLevel, Age, TotalWorkingYears gave the lowes RMSE of around 1200.
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears,
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4947.4 -652.9 -136.8 615.3 4542.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2417.795 197.719 12.228 <2e-16 ***
## JobLevel2 2653.840 107.720 24.636 <2e-16 ***
## JobLevel3 6825.867 152.965 44.624 <2e-16 ***
## JobLevel4 11869.515 255.118 46.525 <2e-16 ***
## JobLevel5 15811.576 290.482 54.432 <2e-16 ***
## Age 4.548 6.316 0.720 0.4716
## TotalWorkingYears 29.545 10.871 2.718 0.0067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared: 0.9259, Adjusted R-squared: 0.9254
## F-statistic: 1797 on 6 and 863 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 2029.728420 2805.86250
## JobLevel2 2442.416087 2865.26407
## JobLevel3 6525.639779 7126.09387
## JobLevel4 11368.789547 12370.24015
## JobLevel5 15241.442578 16381.70959
## Age -7.847635 16.94390
## TotalWorkingYears 8.209551 50.88143
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears,
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4947.4 -652.9 -136.8 615.3 4542.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2417.795 197.719 12.228 <2e-16 ***
## JobLevel2 2653.840 107.720 24.636 <2e-16 ***
## JobLevel3 6825.867 152.965 44.624 <2e-16 ***
## JobLevel4 11869.515 255.118 46.525 <2e-16 ***
## JobLevel5 15811.576 290.482 54.432 <2e-16 ***
## Age 4.548 6.316 0.720 0.4716
## TotalWorkingYears 29.545 10.871 2.718 0.0067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared: 0.9259, Adjusted R-squared: 0.9254
## F-statistic: 1797 on 6 and 863 DF, p-value: < 2.2e-16
## [1] 1203.884
##Predict the salary with test_salary_data using linear model (MonthlyIncome~JobLevel+Age+TotalWorkingYears)